{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Wd48fv-zDMe6"
      },
      "source": [
        "# Welcome to the [Tensor2Tensor](https://github.com/tensorflow/tensor2tensor) Dataset Colab!\n",
        "\n",
        "Tensor2Tensor, or T2T for short, is a library of deep learning models and datasets designed to make deep learning more accessible and [accelerate ML research](https://research.googleblog.com/2017/06/accelerating-deep-learning-research.html).\n",
        "\n",
        "**This colab shows you how to add your own dataset to T2T so that you can train one of the several preexisting models on your newly added dataset!**\n",
        "\n",
        "For a tutorial that covers all the broader aspects of T2T using existing datasets and models, please see this [IPython notebook](https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "FesA0dakI2kh"
      },
      "outputs": [],
      "source": [
        "#@title\n",
        "# Copyright 2018 Google LLC.\n",
        "\n",
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "toc",
        "id": "av8U13aqyEdf"
      },
      "source": [
        "\u003e[Welcome to the Tensor2Tensor Dataset Colab!](#scrollTo=Wd48fv-zDMe6)\n",
        "\n",
        "\u003e\u003e[Installation \u0026 Setup](#scrollTo=Urn4QmNfI3hw)\n",
        "\n",
        "\u003e\u003e[Define the Problem](#scrollTo=LUoP57gOjlk9)\n",
        "\n",
        "\u003e\u003e\u003e[Run t2t_datagen](#scrollTo=Q1xBmlrFLSPX)\n",
        "\n",
        "\u003e\u003e[Viewing the generated data.](#scrollTo=MCqJhdnYgiG-)\n",
        "\n",
        "\u003e\u003e\u003e[tf.python_io.tf_record_iterator](#scrollTo=uNpohcPXKsLN)\n",
        "\n",
        "\u003e\u003e\u003e[Using tf.data.Dataset](#scrollTo=6o_1BHGQC5w5)\n",
        "\n",
        "\u003e\u003e[Terminology](#scrollTo=xRtfC0sHBlSo)\n",
        "\n",
        "\u003e\u003e\u003e[Problem](#scrollTo=xRtfC0sHBlSo)\n",
        "\n",
        "\u003e\u003e\u003e[Modalities](#scrollTo=xRtfC0sHBlSo)\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Urn4QmNfI3hw"
      },
      "source": [
        "## Installation \u0026 Setup\n",
        "\n",
        "\n",
        "We'll install T2T and TensorFlow.\n",
        "\n",
        "We also need to setup the directories where T2T will:\n",
        "\n",
        "*   Generate the dataset and write the TFRecords file representing the training and the eval set, vocabulary files etc `DATA_DIR`\n",
        "*   Run the training, keep the graph and the checkpoint files `OUTPUT_DIR` and\n",
        "*   Use as a scratch directory to download your dataset from a URL, unzip it, etc. `TMP_DIR`"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "both",
        "colab": {},
        "colab_type": "code",
        "id": "IBWBeE39JYaR"
      },
      "outputs": [],
      "source": [
        "#@title Run for installation.\n",
        "\n",
        "! pip install -q -U tensor2tensor\n",
        "! pip install -q tensorflow"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "both",
        "colab": {},
        "colab_type": "code",
        "id": "sbTULiroLs2w"
      },
      "outputs": [],
      "source": [
        "#@title Run this only once - Sets up TF Eager execution.\n",
        "\n",
        "import sys\n",
        "if 'google.colab' in sys.modules: # Colab-only TensorFlow version selector\n",
        "  %tensorflow_version 1.x\n",
        "import tensorflow as tf\n",
        "\n",
        "# Enable Eager execution - useful for seeing the generated data.\n",
        "tf.enable_eager_execution()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "A8JljOzDYF-Z"
      },
      "outputs": [],
      "source": [
        "#@title Setting a random seed.\n",
        "\n",
        "from tensor2tensor.utils import trainer_lib\n",
        "\n",
        "# Set a seed so that we have deterministic outputs.\n",
        "RANDOM_SEED = 301\n",
        "trainer_lib.set_random_seed(RANDOM_SEED)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "ioW-V1qpqSCE"
      },
      "outputs": [],
      "source": [
        "#@title Run for setting up directories.\n",
        "\n",
        "import os\n",
        "\n",
        "# Setup and create directories.\n",
        "DATA_DIR = os.path.expanduser(\"/tmp/t2t/data\")\n",
        "OUTPUT_DIR = os.path.expanduser(\"/tmp/t2t/output\")\n",
        "TMP_DIR = os.path.expanduser(\"/tmp/t2t/tmp\")\n",
        "\n",
        "# Create them.\n",
        "tf.gfile.MakeDirs(DATA_DIR)\n",
        "tf.gfile.MakeDirs(OUTPUT_DIR)\n",
        "tf.gfile.MakeDirs(TMP_DIR)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "LUoP57gOjlk9"
      },
      "source": [
        "## Define the `Problem`\n",
        "\n",
        "To simplify our setting our input text sampled randomly from [a, z] - each sentence has between [3, 20] words with each word being [1, 8] characters in length.\n",
        "\n",
        "Example input: \"olrkpi z cldv xqcxisg cutzllf doteq\" -- this will be generated by `sample_sentence()`\n",
        "\n",
        "Our output will be the input words sorted according to length.\n",
        "\n",
        "Example output: \"z cldv doteq olrkpi xqcxisg cutzllf\" -- this will be processed by `target_sentence()`\n",
        "\n",
        "Let's dive right into our first problem -- we'll explain as we go on.\n",
        "\n",
        "Take some time to read each line along with its comments -- or skip them and come back later to clarify your understanding."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "pDDiPxqg9UF-"
      },
      "outputs": [],
      "source": [
        "#@title Define `sample_sentence()` and `target_sentence(input_sentence)`\n",
        "import random\n",
        "import string\n",
        "\n",
        "def sample_sentence():\n",
        "    # Our sentence has between 3 and 20 words\n",
        "    num_words = random.randint(3, 20)\n",
        "    words = []\n",
        "    for i in range(num_words):\n",
        "        # Our words have between 1 and 8 characters.\n",
        "        num_chars = random.randint(1, 8)\n",
        "        chars = []\n",
        "        for j in range(num_chars):\n",
        "            chars.append(random.choice(string.ascii_lowercase))\n",
        "        words.append(\"\".join(chars))\n",
        "    return \" \".join(words)\n",
        "\n",
        "def target_sentence(input_sentence):\n",
        "    words = input_sentence.split(\" \")\n",
        "    return \" \".join(sorted(words, key=lambda x: len(x)))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "KcT_x4ma-Uaq"
      },
      "outputs": [],
      "source": [
        "# `Problem` is the base class for any dataset that we want to add to T2T -- it\n",
        "# unifies the specification of the problem for generating training data,\n",
        "# training, evaluation and inference.\n",
        "#\n",
        "# All its methods (except `generate_data`) have reasonable default\n",
        "# implementations.\n",
        "#\n",
        "# A sub-class must implement `generate_data(data_dir, tmp_dir)` -- this method\n",
        "# is called by t2t-trainer or t2t-datagen to actually generate TFRecord dataset\n",
        "# files on disk.\n",
        "from tensor2tensor.data_generators import problem\n",
        "\n",
        "# Certain categories of problems are very common, like where either the input or\n",
        "# output is text, for such problems we define an (abstract) sub-class of\n",
        "# `Problem` called `Text2TextProblem` -- this implements `generate_data` in\n",
        "# terms of another function `generate_samples`. Sub-classes must override\n",
        "# `generate_samples` and `is_generate_per_split`.\n",
        "from tensor2tensor.data_generators import text_problems\n",
        "\n",
        "# Every non-abstract problem sub-class (as well as models and hyperparameter\n",
        "# sets) must be registered with T2T so that T2T knows about it and can look it\n",
        "# up when you specify your problem on the commandline to t2t-trainer or\n",
        "# t2t-datagen.\n",
        "#\n",
        "# One uses:\n",
        "# `register_problem` for a new Problem sub-class.\n",
        "# `register_model` for a new T2TModel sub-class.\n",
        "# `register_hparams` for a new hyperparameter set. All hyperparameter sets\n",
        "# typically extend `common_hparams.basic_params1` (directly or indirectly).\n",
        "from tensor2tensor.utils import registry\n",
        "\n",
        "\n",
        "# By default, when you register a problem (or model or hyperparameter set) the\n",
        "# name with which it gets registered is the 'snake case' version -- so here\n",
        "# the Problem class `SortWordsAccordingToLengthRandom` will be registered with\n",
        "# the name `sort_words_according_to_length_random`.\n",
        "#\n",
        "# One can override this default by actually assigning a name as follows:\n",
        "# `@registry.register_problem(\"my_awesome_problem\")`\n",
        "#\n",
        "# The registered name is specified to the t2t-trainer or t2t-datagen using the\n",
        "# commandline flag `--problem`.\n",
        "@registry.register_problem\n",
        "\n",
        "# We inherit from `Text2TextProblem` which takes care of a lot of details\n",
        "# regarding reading and writing the data to disk, what vocabulary type one\n",
        "# should use, its size etc -- so that we need not worry about them, one can,\n",
        "# of course, override those.\n",
        "class SortWordsAccordingToLengthRandom(text_problems.Text2TextProblem):\n",
        "  \"\"\"Sort words on length in randomly generated text.\"\"\"\n",
        "\n",
        "  # START: Methods we should override.\n",
        "\n",
        "  # The methods that need to be overriden from `Text2TextProblem` are:\n",
        "  # `is_generate_per_split` and\n",
        "  # `generate_samples`.\n",
        "\n",
        "  @property\n",
        "  def is_generate_per_split(self):\n",
        "    # If we have pre-existing data splits for (train, eval, test) then we set\n",
        "    # this to True, which will have generate_samples be called for each of the\n",
        "    # dataset_splits.\n",
        "    #\n",
        "    # If we do not have pre-existing data splits, we set this to False, which\n",
        "    # will have generate_samples be called just once and the Problem will\n",
        "    # automatically partition the data into dataset_splits.\n",
        "    return False\n",
        "\n",
        "  def generate_samples(self, data_dir, tmp_dir, dataset_split):\n",
        "    # Here we are generating the data in-situ using the `sample_sentence`\n",
        "    # function, otherwise we would have downloaded the data and put it in\n",
        "    # `tmp_dir` -- and read it from that location.\n",
        "    del tmp_dir\n",
        "\n",
        "    # Unused here, is used in `Text2TextProblem.generate_data`.\n",
        "    del data_dir\n",
        "\n",
        "    # This would have been useful if `self.is_generate_per_split()` was True.\n",
        "    # In that case we would have checked if we were generating a training,\n",
        "    # evaluation or test sample. This is of type `problem.DatasetSplit`.\n",
        "    del dataset_split\n",
        "\n",
        "    # Just an arbitrary limit to our number of examples, this can be set higher.\n",
        "    MAX_EXAMPLES = 10\n",
        "\n",
        "    for i in range(MAX_EXAMPLES):\n",
        "      sentence_input = sample_sentence()\n",
        "      sentence_target = target_sentence(sentence_input)\n",
        "      yield {\n",
        "          \"inputs\"  : sentence_input,\n",
        "          \"targets\" : sentence_target,\n",
        "      }\n",
        "\n",
        "  # END: Methods we should override.\n",
        "\n",
        "  # START: Overridable methods.\n",
        "\n",
        "  @property\n",
        "  def vocab_type(self):\n",
        "    # We can use different types of vocabularies, `VocabType.CHARACTER`,\n",
        "    # `VocabType.SUBWORD` and `VocabType.TOKEN`.\n",
        "    #\n",
        "    # SUBWORD and CHARACTER are fully invertible -- but SUBWORD provides a good\n",
        "    # tradeoff between CHARACTER and TOKEN.\n",
        "    return text_problems.VocabType.SUBWORD\n",
        "\n",
        "  @property\n",
        "  def approx_vocab_size(self):\n",
        "    # Approximate vocab size to generate. Only for VocabType.SUBWORD.\n",
        "    return 2**13  # ~8k\n",
        "\n",
        "  @property\n",
        "  def dataset_splits(self):\n",
        "    # Since we are responsible for generating the dataset splits, we override\n",
        "    # `Text2TextProblem.dataset_splits` to specify that we intend to keep\n",
        "    # 80% data for training and 10% for evaluation and testing each.\n",
        "    return [{\n",
        "        \"split\": problem.DatasetSplit.TRAIN,\n",
        "        \"shards\": 8,\n",
        "    }, {\n",
        "        \"split\": problem.DatasetSplit.EVAL,\n",
        "        \"shards\": 1,\n",
        "    }, {\n",
        "        \"split\": problem.DatasetSplit.TEST,\n",
        "        \"shards\": 1,\n",
        "    }]\n",
        "\n",
        " # END: Overridable methods."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "HwxQpOKhrolK"
      },
      "source": [
        "That's it!\n",
        "\n",
        "To use this with `t2t-trainer` or `t2t-datagen`, save it to a directory, add an `__init__.py` that imports it, and then specify that directory with `--t2t_usr_dir`.\n",
        "\n",
        "i.e. as follows:\n",
        "\n",
        "```\n",
        "$ t2t-datagen \\\n",
        "  --problem=sort_words_according_to_length_random \\\n",
        "  --data_dir=/tmp/t2t/data \\\n",
        "  --tmp_dir=/tmp/t2t/tmp \\\n",
        "  --t2t_usr_dir=/tmp/t2t/usr\n",
        "\n",
        "```\n",
        "\n",
        "However, we'll generate the data from the colab itself as well -- this is what `t2t-datagen` essentially does."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Q1xBmlrFLSPX"
      },
      "source": [
        "## Generate the data.\n",
        "\n",
        "We will now generate the data by calling `Problem.generate_data()` and inspect it."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "atYWRpM1FgaJ"
      },
      "outputs": [],
      "source": [
        "sort_len_problem = SortWordsAccordingToLengthRandom()\n",
        "\n",
        "sort_len_problem.generate_data(DATA_DIR, TMP_DIR)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "MCqJhdnYgiG-"
      },
      "source": [
        "## Viewing the generated data.\n",
        "\n",
        "`tf.data.Dataset` is the recommended API for inputting data into a TensorFlow graph and the `Problem.dataset()` method returns a `tf.data.Dataset` object.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "PZczDWnOQDp2"
      },
      "outputs": [],
      "source": [
        "Modes = tf.estimator.ModeKeys\n",
        "\n",
        "# We can iterate over our examples by making an iterator and calling next on it.\n",
        "sort_len_problem_dataset = sort_len_problem.dataset(Modes.EVAL, DATA_DIR)\n",
        "eager_iterator = sort_len_problem_dataset.make_one_shot_iterator()\n",
        "example = next(eager_iterator)\n",
        "\n",
        "input_tensor = example[\"inputs\"]\n",
        "target_tensor = example[\"targets\"]\n",
        "\n",
        "# The tensors are actually encoded using the generated vocabulary file -- you\n",
        "# can inspect the actual vocab file in DATA_DIR.\n",
        "print(\"Tensor Input: \" + str(input_tensor))\n",
        "print(\"Tensor Target: \" + str(target_tensor))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "1DtfzgqivAxl"
      },
      "outputs": [],
      "source": [
        "\n",
        "# We use the encoders to decode the tensors to the actual input text.\n",
        "input_encoder = sort_len_problem.get_feature_encoders(\n",
        "    data_dir=DATA_DIR)[\"inputs\"]\n",
        "target_encoder = sort_len_problem.get_feature_encoders(\n",
        "    data_dir=DATA_DIR)[\"targets\"]\n",
        "\n",
        "input_decoded = input_encoder.decode(input_tensor.numpy())\n",
        "target_decoded = target_encoder.decode(target_tensor.numpy())\n",
        "\n",
        "print(\"Decoded Input: \" + input_decoded)\n",
        "print(\"Decoded Target: \" + target_decoded)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "xRtfC0sHBlSo"
      },
      "source": [
        "## To be continued ...\n",
        "\n",
        "Stay tuned for additions to this notebook for adding problems with non-text modalities like Images, Audio and Video!"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "collapsed_sections": [],
      "name": "t2t_problem.ipynb",
      "provenance": [
        {
          "file_id": "1FwspR4PzEZAiQCGziob5oov-8DyEXSnw",
          "timestamp": 1533664607636
        }
      ]
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
